Red Wine Quality Analysis by Sun Xiaobing

This project was aimed to find Which chemical properties influence the quality of red wines. The data set was created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

Load the data

# Load the Data
wineQualityReds<-read.csv("wineQualityReds.csv",na.string="",row.names=1)

Overview

dim(wineQualityReds)
## [1] 1599   12
str(wineQualityReds)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
names(wineQualityReds)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Let’s check missing values.

apply(wineQualityReds,2,function(x){sum(is.na(x))})
##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

No missing values here.

Univariate Plots Section

quality

quality<-wineQualityReds$quality
summary(quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
pie(table(quality),main="Pie of Red Wine Qualities")

out<-boxplot.stats(quality)$out
out
##  [1] 8 8 8 8 8 3 8 8 8 3 8 3 8 3 3 8 8 8 8 8 3 3 8 8 3 3 3 8

Most red wines have qualities ranging from 5 to 7, the average quality is 5.64, the best wines are ranked 8 and the worst wines are ranked 3 in this dataset, and they are regarded as outliers, but we still keep them for further analysis.

fixed.acidity

fixed.acidity<-wineQualityReds$fixed.acidity
summary(fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
qplot(fixed.acidity,binwidth=0.2)

Most values concentrate between 6 and 11.The data is right skewed. Let’s look at the density.

qplot((fixed.acidity),data=wineQualityReds,geom="density",color=factor(quality))

It seems the distributions in Quality 7 and 8 are different, perhaps that’s the reason of being skewed.

qplot(factor(quality),fixed.acidity,data=wineQualityReds,geom="boxplot")

The medians do not show clear tendancy.

volatile.acidity

volatile.acidity<-wineQualityReds$volatile.acidity
summary(volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
qplot(volatile.acidity,bins=30)

qplot(volatile.acidity,geom="density")

qplot(factor(quality),volatile.acidity,data=wineQualityReds,geom="boxplot")

It seems volatile.acidity has a negative impact on quality in terms of medians. Most values concentrate bewteen 0.25 and 0.8.

citric.acid

citric.acid<-wineQualityReds$citric.acid
summary(wineQualityReds$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
qplot(citric.acid,bins=30)

qplot(citric.acid,geom='density')

The data is right skewed, and the density has many peaks.

qplot(citric.acid,color=factor(quality),geom='density')

The distribution varies with quality.

qplot(factor(quality),citric.acid,data=wineQualityReds,geom="boxplot")

sum(wineQualityReds$citric.acid==0)
## [1] 132

It seems citric acid has positive influence on the quality of red wines. Note: 132 citric.acid values are 0, perhaps the amounts are to small to detect.

residual.sugar

residual.sugar<-wineQualityReds$residual.sugar
summary(residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
qplot(residual.sugar,bins=30)

The data has a long tail, perhaps due to outliers, let’s remove the outlier.

out<-boxplot.stats(wineQualityReds$residual.sugar)$out#Outliers
length(out)#number of outliers
## [1] 155
qplot(residual.sugar,binwidth=0.1) + scale_x_continuous(limits=c(1,5))

Now it seems more symmetric.

qplot(x=factor(quality),y=residual.sugar,geom="boxplot",ylim=c(1,4))

There are many outliers within each quality. And it seems no clear tendancy between residual.sugar and quality.

chlorides

chlorides<-wineQualityReds$chlorides
summary(chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
qplot(chlorides,bins=30)

The data has a long tail, obviously due to outliers.

qplot(chlorides[chlorides<0.15],bins=30)

After removing the outliers, the distribution tend to be symmetric.

qplot(factor(quality),chlorides,data=wineQualityReds,geom="boxplot")

qplot(factor(quality),chlorides,data=wineQualityReds,geom="boxplot",
      ylim=c(0,0.2))
## Warning: Removed 41 rows containing non-finite values (stat_boxplot).

There are also many chlorides outliers within Quality 5 and 6. There’s no clear tendancy.

free.sulfur.dioxide

free.sulfur.dioxide<-wineQualityReds$free.sulfur.dioxide
summary(free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
qplot(free.sulfur.dioxide,bins=30)

The data is a little right skewed.

out<-boxplot.stats(wineQualityReds$free.sulfur.dioxide)$out#Outliers
length(out)
## [1] 30
qplot(log(free.sulfur.dioxide),bins=25)

qplot(factor(quality),free.sulfur.dioxide,data=wineQualityReds,geom="boxplot")

Log transformation does not make it better, and there’s no clear tendancy between quality and free sulfur dioxide.

qplot(free.sulfur.dioxide,color=factor(quality),geom="density")

total.sulfur.dioxide

total.sulfur.dioxide<-wineQualityReds$total.sulfur.dioxide
summary(total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
qplot(total.sulfur.dioxide,bins=30)

The data is also right skewed.

out<-boxplot.stats(total.sulfur.dioxide)$out
length(out)
## [1] 55
qplot(log(total.sulfur.dioxide),bins=30)

After log transformation, the data turns to be somehow symmetric.

qplot(factor(quality),total.sulfur.dioxide,data=wineQualityReds,geom="boxplot")

No clear tendancy. In ordert to analyze bound form sulfur dioxide, we create a new variable bf.sulfur.dioxide.

wineQualityReds<-mutate(wineQualityReds, bf.sulfur.dioxide=total.sulfur.dioxide-free.sulfur.dioxide)
bf.sulfur.dioxide<-wineQualityReds$bf.sulfur.dioxide
qplot(bf.sulfur.dioxide,bins=30)

qplot(log(bf.sulfur.dioxide),bins=30)

Still right skewed. After log transformation, the data turns to be somehow symmetric.

Density

density<-wineQualityReds$density
summary(density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
qplot(density,bins=30)

var(density)
## [1] 3.562029e-06

The data looks symmetric. The values are quite centered 0.9967, and the variance is small.

qplot(factor(quality),density,data=wineQualityReds,geom="boxplot")

Density has a negative impact on quality

pH

pH<-wineQualityReds$pH
summary(pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
qplot(pH,bins=30)

The data looks symmetric.

qplot(factor(quality),pH,data=wineQualityReds,geom="boxplot")

It seems pH has a negative impact on quality.

sulphates

sulphates<-wineQualityReds$sulphates
summary(sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
qplot(sulphates,bins=30)

The data looks a little right skewed. Let’s cut the tail.

qplot(sulphates[sulphates<1],bins=30)

Still right skewed.

qplot(factor(quality),sulphates,data=wineQualityReds,geom="boxplot")

It seems sulphates have positive impact on quality.

alcohol

alcohol<-wineQualityReds$alcohol
summary(alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
qplot(alcohol,bins=30)

The data is right skewed.

qplot(log(alcohol),bins=30)

Log transformation does not work well here.

qplot(alcohol,data=wineQualityReds,geom="density",facets=.~quality)

It seems the alcohol distributes differently along with different qualities. Perhaps that’s the reason why it is skewed.

qplot(factor(quality),alcohol,data=wineQualityReds,geom="boxplot")

It seems alcohol has a positive impact on the quality in terms of median, coincident with the density plot above.

Univariate Analysis

What is the structure of your dataset?

The dataset consist of 1599 rows and 12 columns. There are 1599 red wines, 11 numeric features(“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH” ,“sulphates” ,“alcohol” ,“quality”) and 1 integer outcome(quality), The range of quality in this dataset was 3-8(from worst to best). The average of red wine quality was 5.636, and the median was 6, most of red wine qualities are between 5 and 7. Furthermore, fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol are right skewed from histograms, and residual.sugar, chlorides both have outliers which influence their ditributions to some extent.

What is/are the main feature(s) of interest in your dataset?

According to the boxplots above, quality, volatile.acidity, citric.acid, density, pH, sulphates and alcohol are main features, because they seem to have either positive or negative relationship with quality. Exactly, citric.acid, sulphates and alcohol seem to have positive impact on quality, whereas the other main features have negative impact.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think fixed.acidity, chlorides, residual.sugar, total.sulfur.dioxide, bf.sulfur.dioxide, these features can help me investigate into main features because they do not have missing values.

Did you create any new variables from existing variables in the dataset?

I created a bf.sulfur.dioxide feature based on total sulfur dioxide and free sulfur dioxide. These two sulfur may have different contribution to the qualities of red wine. Actually , the log transformations of total.sulfur.dioxide and bf.sulfur.dioxide have better shapes in terms of ditribution.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and alcohol are right skewed from histograms, from the density plots, they have different ditributions along with different qualities. I used log transformation and outlier removal methods to make their distributions more balanced. If I remove outliers from sulphates and chlorides, they tend to have better hitograms,. Also, the log transformation works for total.sulfur.dioxide and bf.sulfur.dioxide.

wineQualityReds2<-mutate(wineQualityReds,log.total.sulfur.dioxide=log(total.sulfur.dioxide),log.bf.sulfur.dioxide=log(bf.sulfur.dioxide))
wineQualityReds2<-select(wineQualityReds2,-total.sulfur.dioxide,-bf.sulfur.dioxide,-free.sulfur.dioxide)

We made log transformations on total.sulfur.dioxide and bf.sulfur.dioxide, and created wineQualityReds2.

Bivariate Plots Section

Relationship between main features

quality vs other features

Group the dataset by quality, calculate median and var for each feature.

wineQualityReds.group<-group_by(wineQualityReds2,quality)
s<-summarize(wineQualityReds.group,citric_md=median(citric.acid),
             sul_md=median(sulphates),
             alc_md=median(alcohol),
             vol_md=median(volatile.acidity),
             pH_md=median(pH),
             citric_var=var(citric.acid),
             sul_var=var(sulphates), 
              alc_var=var(alcohol),
             vol_var=var(volatile.acidity),
              pH_var=var(pH)     
              )
s
## Source: local data frame [6 x 11]
## 
##   quality citric_md sul_md alc_md vol_md pH_md citric_var    sul_var
##     (int)     (dbl)  (dbl)  (dbl)  (dbl) (dbl)      (dbl)      (dbl)
## 1       3     0.035  0.545  9.925  0.845  3.39 0.06283222 0.01488889
## 2       4     0.090  0.560 10.000  0.670  3.37 0.04041321 0.05730806
## 3       5     0.230  0.580  9.700  0.580  3.30 0.03240095 0.02926229
## 4       6     0.260  0.640 10.500  0.490  3.32 0.03806730 0.02516967
## 5       7     0.400  0.740 11.500  0.370  3.28 0.03780388 0.01839791
## 6       8     0.420  0.740 12.150  0.370  3.23 0.03981046 0.01331242
## Variables not shown: alc_var (dbl), vol_var (dbl), pH_var (dbl)

Just as the boxplots indicates above, good red wines tend to contain more citric.acid, sulphates, alcohol, and less volatile.acidity as well as pH.

s1<-gather(s,feature,median,citric_md:pH_md)
qplot(quality,median,data=s1,facets=.~feature)+geom_line()

s2<-gather(s,feature,var,citric_var:pH_var)
qplot(quality,var,data=s2,facets=.~feature)+geom_line()

Let’s have a look at the relationship between main features of interests.

pairs(volatile.acidity~citric.acid+pH+density+sulphates+alcohol,data=wineQualityReds2)

cor(select(wineQualityReds2,volatile.acidity,citric.acid,pH,density,sulphates,alcohol))
##                  volatile.acidity citric.acid         pH     density
## volatile.acidity       1.00000000  -0.5524957  0.2349373  0.02202623
## citric.acid           -0.55249568   1.0000000 -0.5419041  0.36494718
## pH                     0.23493729  -0.5419041  1.0000000 -0.34169933
## density                0.02202623   0.3649472 -0.3416993  1.00000000
## sulphates             -0.26098669   0.3127700 -0.1966476  0.14850641
## alcohol               -0.20228803   0.1099032  0.2056325 -0.49617977
##                    sulphates     alcohol
## volatile.acidity -0.26098669 -0.20228803
## citric.acid       0.31277004  0.10990325
## pH               -0.19664760  0.20563251
## density           0.14850641 -0.49617977
## sulphates         1.00000000  0.09359475
## alcohol           0.09359475  1.00000000

It seems there is certain relationship between volatile.acidity and citric.acid, citric.acid and pH.

volatile.acidity vs citric.acid

g<-ggplot(wineQualityReds2,aes(volatile.acidity,citric.acid))
g<-g+geom_point(alpha=1/10)+geom_smooth()
g

Let’s check the linear relationship.

g+geom_smooth(method="lm")

cor.test(volatile.acidity,citric.acid,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

According to the correlation value, volatile.acidity does have a negative impact on citric.acid.

citric.acid vs pH

g<-ggplot(wineQualityReds2,aes(citric.acid,pH))
g+geom_jitter()+geom_smooth(method="lm")

cor.test(citric.acid,pH,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and pH
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5756337 -0.5063336
## sample estimates:
##        cor 
## -0.5419041

pH has negative impact on citric.acid.

relationship between Other features(not main features)

pairs(fixed.acidity~chlorides+residual.sugar+log.total.sulfur.dioxide+log.bf.sulfur.dioxide,data=wineQualityReds2)

cor(select(wineQualityReds2,fixed.acidity,chlorides,residual.sugar,log.total.sulfur.dioxide,log.bf.sulfur.dioxide))
##                          fixed.acidity  chlorides residual.sugar
## fixed.acidity               1.00000000 0.09370519     0.11477672
## chlorides                   0.09370519 1.00000000     0.05560954
## residual.sugar              0.11477672 0.05560954     1.00000000
## log.total.sulfur.dioxide   -0.11789982 0.06022193     0.14747141
## log.bf.sulfur.dioxide      -0.04820219 0.08794616     0.15160657
##                          log.total.sulfur.dioxide log.bf.sulfur.dioxide
## fixed.acidity                         -0.11789982           -0.04820219
## chlorides                              0.06022193            0.08794616
## residual.sugar                         0.14747141            0.15160657
## log.total.sulfur.dioxide               1.00000000            0.94150455
## log.bf.sulfur.dioxide                  0.94150455            1.00000000

It seems that log.total.sulfur.dioxide and log.bf.sulfur.dioxide has very strong linear relationship. Therefore, in later discussion, we only talk about log.total.sulfur.dioxide.

g<-ggplot(wineQualityReds2,aes(log.total.sulfur.dioxide,log.bf.sulfur.dioxide))
g+geom_point()+geom_smooth(method="lm")

relationship between main features and other features

volatile.acidity vs others

No clear relationship between volatile.acidity and other features.

citric.acid vs fixed.acidity

qplot(fixed.acidity,citric.acid,geom=c("smooth","point"))

qplot(chlorides,citric.acid,data=wineQualityReds2,geom=c("smooth","point"))

qplot(residual.sugar,citric.acid,geom=c("smooth","point"))

qplot(log.total.sulfur.dioxide,citric.acid,geom=c("smooth","point"))

It seems fixed.acidity has positive relationship with citric.acid.

cor.test(citric.acid,fixed.acidity,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

And the correlation value proves our assumption.

pH vs others

qplot(fixed.acidity,pH,geom=c("smooth","point"))

qplot(chlorides,pH,geom=c("smooth","point"))

qplot(residual.sugar,pH,geom=c("smooth","point"))

qplot(log.total.sulfur.dioxide,pH,geom=c("smooth","point"))

It seems fixed.acidity has negative relationship on pH.

cor.test(pH,fixed.acidity,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pH and fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

The correlation value is over 0.6.

sulphates vs others

No obvious relationship.

density vs others

qplot(fixed.acidity,density,geom=c("smooth","jitter"),alpha=1/10)

qplot(chlorides,density,geom=c("smooth","point"))

qplot(residual.sugar,density,geom=c("smooth","point"))

qplot(log.total.sulfur.dioxide,density,geom=c("smooth","point"))

It seems fixed.acidity has positive relationship with density.

cor.test(fixed.acidity,density,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

The correlation value is over 0.6.

alcohol vs others

No clear relationship.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Just as the boxplots indicates above, good red wines(high qualities) tend to contain more citric.acid, sulphates, alcohol, and less volatile.acidity as well as pH in terms of medians. Alcohol has a relatively large variance with each quality, partly because the features have not been normalized. From the scatterplots above, it seems volatile.acidity increases as citric.acid increases, and pH has negative impact on citric.acid. Besides, citric.acid and density grows as fixed.acidity increases, whereas pH is at a low while fixed.acidity increase, their correlation are above 0.6 which shows quite strong linear relationship.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

log(bf.sulfur.dioxide) has a very strong positive linear relationship with log(total.sulfur.dioxide). So we can remove one of them for further discussion.

What was the strongest relationship you found?

According to the correlations and plots, fixed.acidity has strong relationship with citric.acid, density as well as PH, their correlation coefficients are above 0.6. The fixed.acidity can be removed because it is dependent of other features, also bf.sulfur.dioxide can be removed.

Multivariate Plots Section

g<-ggplot(wineQualityReds2,aes(volatile.acidity,citric.acid))
g+geom_jitter()+geom_smooth(method="lm")+facet_grid(.~quality)

g<-ggplot(wineQualityReds2,aes(pH,citric.acid))
g+geom_jitter()+geom_smooth(method="lm")+facet_grid(.~quality)

It seems the quality does not impact the tendancy between volatile.acidity and citric.acid, pH and citric.acid. Because citric.acid are correlated with volatile.acidity and pH, we do not consider citric.acid as an independent feature.

cut some variables

As there are only a few high-quality or bad-quality red wine samples, we can cut the qualities into three categories:“bad”,“good”,“excellent”.

wineQualityReds2<-mutate(wineQualityReds2,quality_join = cut(quality, 
breaks=c(2,4,6,8),labels=c("bad","good","excellent")),
alcohol_join=cut(alcohol,breaks=c(8,9.5,10.2,11.1,15),
                 labels=c("lower","low","high","higher")))

qplot(volatile.acidity,data=wineQualityReds2,geom="density",color=quality_join)

qplot(sulphates,data=wineQualityReds2,geom="density",color=quality_join)

qplot(pH,data=wineQualityReds2,geom="density",color=quality_join)

pH distribution doesn’t vary much with quality.

figure out which features can separate different qualities

qplot(volatile.acidity,sulphates,data=wineQualityReds2, 
          geom="jitter",alpha=1/10, color=quality_join) 

qplot(pH,sulphates,data=wineQualityReds2, 
          geom="jitter",alpha=1/10, color=quality_join)

It seems good qualities gather in lower volatile.acidity and higher sulphates or higher pH.

qplot(volatile.acidity,sulphates,data=wineQualityReds2,
          geom="jitter",alpha=1/10, color=quality_join,facets=alcohol_join~.) 

More alcohol, more sulphates and less valatile.acidity tend to make good wine.

qplot(volatile.acidity,sulphates*pH,data=wineQualityReds2, 
          geom="jitter",alpha=1/10, color=quality_join,facets=alcohol_join~.) 

qplot(volatile.acidity,sulphates*chlorides,data=wineQualityReds2, 
          geom="jitter",alpha=1/10, color=quality_join)

qplot(pH/sulphates,volatile.acidity,data=wineQualityReds2, 
          geom="jitter",alpha=1/10, color=quality_join)

These interactions do not improve the classification efficient.

Classification Tree model

library(tree)
## Warning: package 'tree' was built under R version 3.2.5
tree.fit<-tree(quality_join~volatile.acidity+sulphates+alcohol+pH,
                data=wineQualityReds2)
plot(tree.fit)
text(tree.fit,cex=0.7)

summary(tree.fit)
## 
## Classification tree:
## tree(formula = quality_join ~ volatile.acidity + sulphates + 
##     alcohol + pH, data = wineQualityReds2)
## Number of terminal nodes:  10 
## Residual mean deviance:  0.8083 = 1284 / 1589 
## Misclassification error rate: 0.1538 = 246 / 1599

The accuracy is over 80%, and those features do prove our previous analysis.

tree.fit<-tree(quality_join~.-fixed.acidity-quality-citric.acid,
               data=wineQualityReds2)
plot(tree.fit)
text(tree.fit,cex=0.7)

summary(tree.fit)
## 
## Classification tree:
## tree(formula = quality_join ~ . - fixed.acidity - quality - citric.acid, 
##     data = wineQualityReds2)
## Variables actually used in tree construction:
## [1] "alcohol"                  "sulphates"               
## [3] "log.total.sulfur.dioxide" "volatile.acidity"        
## [5] "pH"                      
## Number of terminal nodes:  11 
## Residual mean deviance:  0.7954 = 1263 / 1588 
## Misclassification error rate: 0.1538 = 246 / 1599

The other features do not contribute much to the classification.

Multivariate Analysis

We transformed the red wine quality decision problem into a classification problem. We cut the qualities into three categories, and selected several strong main features based on the previous analysis, and explored how they decided the quality of red wine with each other, even with interactions. Also we built a tree model to analyse main features and other features, which reinforced our early analysis.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the figures and tree model discussed above, volatile.acidity, sulphates and alcohol, they strengthened each other in deciding the quality of red wine, actually they are also main features as we discussed in early univariate and bivariate analysis, but pH seemed not help much in classification. And we tried to explore other features’ influence in classification, such as chlorides and log.total.sulfur.dioxide, they do not paly important roles in the model.

Were there any interesting or surprising interactions between features?

I haven’t found any interesting interactions between features.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a classification tree model to explore the relationship between quality and other features. The strengths are that it is easy to understand and explain the model from the tree plots, the quality are decided by the nodes of other features. Besides, tree model can work very quickly on new dataset as well. The limitations is that tree model may be too flexible, which can cause overfitting in training dataset, the accuracy can be more ideal if we use a more proper model such as random forest.


Final Plots and Summary

Plot One

Description One

The values of Citric Acid are right skewed.

Plot Two

Description Two

Some features are quite correlated with others, i.e., fixed.acidity has certain linear relationship with citric.acid.

Plot Three

Description Three

From the plot, it is clear that more alcohol, more sulphates and less valatile.acidity tend to make good wine.


Reflection

The wineQualityReds dataset has 1599 red wines whose qualities range from 3 to 8.I started by analyzing each features in the dataset through histograms, boxplots and summary, and then I explored the relationship between features and outcome through scatterplots. Eventually, I transformed the problem into a classification one and explored the quality of red wine across many variables and created a tree model to predict qualities. I found features volatile.acidity, sulphates and alcohol have important impacts on quality, whereas the other features such as fixed.acidity, pH do not play important roles in terms of accuracy. In addition, features such as fixed.acidity, citric.acid, free.sulfur.dioxide are quite dependent on other features, so I did not consider them in later discussion. For the tree model, I tried to find how the significant features decided quality, finally I only kept volatile.acidity, sulphates and alcohol as siginificant features. Some limitations of this model were, The limitations is that tree model may be too flexible, which can cause overfitting in training dataset, the accuracy can be more ideal if we use a more proper model such as random forest. Considering that most of red wines have qualities ranging from 5 to 7, there are just a few samples for bad and top wines, perhaps the model will do better if the dataset are balanced.